NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: MySQL
Which method is best to select values present in one table but missing in another one?
This:
SELECT l.* FROM t_left l LEFT JOIN t_right r ON r.value = l.value WHERE r.value IS NULL, this:
SELECT l.* FROM t_left l WHERE l.value NOT IN ( SELECT value FROM t_right r )or this:
SELECT l.* FROM t_left l WHERE NOT EXISTS ( SELECT NULL FROM t_right r WHERE r.value = l.value )
Finally, it's MySQL time.
As always, we will create the sample tables:
Table t_left
contains 100,000 rows with 10,000 distinct values.
Table t_right
contains 1,000,000 rows with 10,000 distinct values.
There are 10 rows in t_left
with values not present in t_right
.
In both tables the field value
is indexed.
LEFT JOIN / IS NULL
SELECT l.id, l.value FROM t_left l LEFT JOIN t_right r ON r.value = l.value WHERE r.value IS NULL
View query results and execution plan
The query returns correct results in 0.73 seconds.
If we look into the Extra part of the execution plan, we will see an interesting thing there:
Using where; Using index; Not exists
What does it mean?
MySQL documentation mentions this:
Not exists
MySQL was able to do a
LEFT JOIN
optimization on the query and does not examine more rows in this table for the previous row combination after it finds one row that matches theLEFT JOIN
criteria. Here is an example of the type of query that can be optimized this way:SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id WHERE t2.id IS NULL;Assume that
t2.id
is defined asNOT NULL
. In this case, MySQL scanst1
and looks up the rows int2
using the values oft1.id
. If MySQL finds a matching row int2
, it knows thatt2.id
can never beNULL
, and does not scan through the rest of the rows int2
that have the same id value. In other words, for each row int1
, MySQL needs to do only a single lookup int2
, regardless of how many rows actually match int2
.
This is exactly our case.
MySQL, as well as all other systems except SQL Server, is able to optimize LEFT JOIN / IS NULL
to return FALSE
as soon the matching value is found, and it is the only system that cared to document this behavior.
The assumption that t2.id
should be defined as NOT NULL
, however, is too strong, since a successfull JOIN
on equality condition implies that the value found is NOT NULL
.
Since MySQL is not capable of using HASH
and MERGE
join algorithms, the only ANTI JOIN
it is capable of is the NESTED LOOPS ANTI JOIN
, which is exactly what we see in the query plan (despite the fact that MySQL doesn't call it that). However, this behavior is what an ANTI JOIN
does: it checks the values from the left table against only one of each distinct values in the right table, skipping the duplicates.
NOT IN
SELECT l.id, l.value FROM t_left l WHERE value NOT IN ( SELECT value FROM t_right )
View query results and execution plan
This query is as fast as the LEFT JOIN / NOT NULL
, however its plan looks quite different.
First, it mentions a DEPENDENT SUBQUERY
instead of a second table (which was used in the LEFT JOIN / IS NULL
). Nominally, is a dependent subquery indeed, since we don't have a join here but rather a SELECT
from a single table with predicate in the WHERE
clause.
Second, the description part of the plan mentions this:
<exists>(<index_lookup>(<cache>(`20090918_anti`.`l`.`value`) in t_right on ix_right_value))
What is that?
This is of course our good old friend, the ANTI JOIN
. MySQL applies EXISTS
optimization to the subquery: it uses the index scan over ix_right_value
and returns as soon as it finds (or not finds) a row.
NOT IN
is different in how it handles NULL
values. However, since both t_left.value
and t_right.value
are defined as NOT NULL
, no NULL
value can ever be returned by this predicate, and MySQL takes this into account.
Essentially, this is exactly the same plan that LEFT JOIN / IS NULL
uses, despite the fact these plans are executed by the different branches of code and they look different in the results of EXPLAIN
. The algorithms are in fact the same in fact and the queries complete in same time.
NOT EXISTS
SELECT l.id, l.value FROM t_left l WHERE NOT EXISTS ( SELECT value FROM t_right r WHERE r.value = l.value )
View query results and execution plan
This query of course produces the same results.
Execution plan, again, is different. MySQL is the only system that produces three different plans for three different methods.
The plan does not differ much: MySQL does know what an index lookup is and what EXISTS
is and it does combine them together.
EXISTS
in MySQL is optimized so that it returns as soon as the first value is found. So this query in fact is an ANTI JOIN
as well as first two queries are.
This query, however, is a little bit less efficient than the previous two: it takes 0.92 s.
This is not much of a performance drop, however, the query takes 27% more time.
It's hard to tell exact reason for this, since this drop is linear and does not seem to depend on data distribution, number of values in both tables etc., as long as both fields are indexed. Since there are three pieces of code in MySQL that essentialy do one job, it is possible that the code responsible for EXISTS
makes some kind of an extra check which takes extra time.
Summary
MySQL can optimize all three methods to do a sort of NESTED LOOPS ANTI JOIN
.
It will take each value from t_left
and look it up in the index on t_right.value
. In case of an index hit or an index miss, the corresponding predicate will immediately return FALSE
or TRUE
, respectively, and the decision to return the row from t_left
or not will be made immediately without examining other rows in t_right
.
However, these three methods generate three different plans which are executed by three different pieces of code. The code that executes EXISTS
predicate is about 30% less efficient than those that execute index_subquery
and LEFT JOIN
optimized to use Not exists
method.
That's why the best way to search for missing values in MySQL is using a LEFT JOIN / IS NULL
or NOT IN
rather than NOT EXISTS
.
Great article, which I’ve often cited in answers on SO. However, I’m struggling to understand how your conclusion reconciles with [Optimizing Subqueries with `EXISTS` Strategy][1] which (to my reading) suggests that `NOT EXISTS` should be *more efficient* than `NOT IN`? Grateful for your thoughts.
eggyal
24 Dec 12 at 01:04
Great article, which I’ve often cited in answers on SO. However, I’m struggling to understand how your conclusion reconciles with [Optimizing Subqueries with `EXISTS` Strategy][1] which (to my reading) suggests that `NOT EXISTS` should be *more efficient* than `NOT IN`? Grateful for your thoughts.
[1]: http://dev.mysql.com/doc/en/subquery-optimization-with-exists.html
eggyal
24 Dec 12 at 01:05
@eggyal: first,
NOT EXISTS
andEXISTS
(anti-join and semi-join) are very different things using different plans. The former cannot be efficiently rewritten using the outer table leading, not with nested loops, the second can.Second, the article you’re linking concerns the difference in handling
NULL
values (EXISTS
is bivalent,IN
is trivalent). This is a very interesting subject however I did not cover it in this article.Quassnoi
24 Dec 12 at 15:50
Hi,
Its a Great analysis, whichever i saw in forums. Its really practical examples are used and every one will understand easily.
I were never used not existing, But i got a damn information from these article. really great and really appreciated…
Thanks alot…
Ashraf K C
30 Jan 13 at 13:36
what’s wrong with this?
SELECT
c.id_category,
c.category
FROM category as c
LEFT JOIN item_category_assign as i ON i.id_category = c.id_category
WHERE i.id_item IS NULL AND t.type = ‘digital’
this return 0 rows, can you explain that?
dhamaso
19 Apr 13 at 01:19
@dhamaso: what’s
t.type
? You don’t have a table called or aliasedt
in your query.Quassnoi
19 Apr 13 at 01:21
Sorry is i.type
SELECT
c.id_category,
c.category
FROM category as c
LEFT JOIN item_category_assign as i ON i.id_category = c.id_category
WHERE i.id_item IS NULL AND i.type = ‘digital’
dhamaso
31 May 13 at 22:12
@dhamaso: move
i.type
into the ON clause:SELECT c.id_category, c.category
FROM category c
LEFT JOIN
item_category_assign i
ON i.id_category = c.id_category
AND i.type = 'digital'
WHERE i.id_item IS NULL
When
id_item
is a null due the left join, all other fields are NULL as well.Quassnoi
31 May 13 at 22:16
look at the two sqls:
select u.* from cms_points_member m,cms_users u left join cms_users_send_sms_queue q on u.users_id=q.users_id where u.relation_id=m.relation_id and q.users_id IS NULL and q.message_id=1
Time:0.0008
This sql has not result!
SELECT u. *
FROM cms_users u, cms_points_member m
WHERE u.relation_id = m.relation_id
AND NOT
EXISTS (
SELECT 1
FROM cms_users_send_sms_queue q
WHERE q.users_id = u.users_id
AND q.message_id =1
)
Time:0.0012
This sql has result !
If the two sqls is the same meaning,who tell me why?
tonnycao
8 Jun 13 at 03:11
OK,I know the error .
SELECT u. *
FROM cms_points_member m, cms_users u
LEFT JOIN cms_users_send_sms_queue q ON u.users_id = q.users_id
AND q.message_id =1
WHERE u.relation_id = m.relation_id
AND q.users_id IS NULL
This is correct!
Thinks blogger.
tonnycao
8 Jun 13 at 03:16
“value NOT IN (select value etc.) ”
is slow but it works great.
thank you !
michele
2 Sep 13 at 19:55
Thanks for your article~~!But I still wanna ask a question.Assume that t_left left join t_right by two conditions con1 and con2.There is an unique key created on con1 and con2 like unique key (con1,con2).So,will the execution plan still get a join type “ef_ref” for the following sql like :
SELECT *
FROM t1
LEFT JOIN
t2
ON t1.con1=t2.con1 and t1.con2=t2.con2
WHERE t2.con2 is NULL
Chris
16 Nov 13 at 10:32
@Chris: yes, it will
Quassnoi
16 Nov 13 at 21:54
I did the test I mentioned above.The execution plan did get a join type “ef_ref”.But you shoule make con1 and con2 the same data type in both t_left and t_right.For example, the same character set and collation when con1 and con2 are varchar
Chris
17 Nov 13 at 06:20
Excuse me, a topic out, but Who is better between “in” and “not in” if the values in “in” are a lot than values in “not in”?
thanks
example
1 case:field IN (3,6,7,8,44, …, 1052)
2 case field NOT IN (1,2)
oscar
22 Nov 13 at 12:28
@oscar: what do you mean by “who is better”? Please clarify your question.
Quassnoi
22 Nov 13 at 12:30
SELECT l.*
FROM t_left l
LEFT JOIN
t_right r
ON r.value = l.value
WHERE r.value IS NULL
This is more fast other than.
This is good post.
Mohan lal
3 Mar 15 at 15:17
hi is there someone who can help me according to this query.
//for getting data selected table
public function getUsers($tblName,$where){
$this->db->select(‘users.*,user_team.u_id’);
$this->db->from(‘users’);
$this->db->join(‘user_team’,’user_team.u_id!=users.u_id’);
$this->db->where($where);
$query = $this->db->get();
echo $this->db->last_query();
echo ”;
print_r($query->result());
exit;
if($query->num_rows() > 0){
return $query;
}
}
I need those users which u_id dosn’t match in user_team column u_id.
Global World
8 Jun 15 at 10:38
@GlobalWorld: please ask that on Stack Overflow and send me the link
Quassnoi
10 Jun 15 at 19:42
In versions before 5.6 (?), “IN ( SELECT … )” would repeatedly perform the subquery. This led to terrible performance.
In 5.6, the subquery is ‘materialized’ and an index is dynamically built. Both of these steps speed up the processing a lot, but not to the point of being as fast as LEFT JOIN.
Rick James
17 Jun 15 at 00:49
@RickJames: it’s not entirely true. It would repeatedly perform a subquery indeed, but not the subquery. It would push the predicate inside the subquery, so in our case
SELECT value FROM t_right WHERE t_right.value = t_left.value
is what would be called in a loop, not the originalSELECT value FROM t_right
. With proper indexing, it works quite well.Quassnoi
17 Jun 15 at 01:57
Great post. But after testing in my website, I got a strange result.. Screenshot http://i.imgur.com/jLvVnXX.jpg
I have cleared cache (FLUSH QUERY CACHE;) before querying.. dict_en_zhtw have 3.2 million rows, dict_en_zhtw2 have 1000 rows.
Query 1: 0.011 second
SELECT l.dict_id, l.dict_word FROM dict_en_zhtw l
WHERE l.dict_word NOT IN (SELECT dict_word FROM dict_en_zhtw2 r)
Query 2: 0.000 second ?
SELECT l.dict_id, l.dict_word FROM dict_en_zhtw l
LEFT JOIN dict_en_zhtw2 r
ON r.dict_word = l.dict_word
WHERE r.dict_word IS NULL
Query 3: 1.4551 second
SELECT dict_id, dict_word FROM dict_en_zhtw
WHERE dict_id NOT IN (SELECT l.dict_id FROM dict_en_zhtw AS l, dict_en_zhtw2 AS r WHERE r.dict_word = l.dict_word)
Query 4: 0.002 second
SELECT l.dict_id, l.dict_word FROM dict_en_zhtw l
WHERE NOT EXISTS (SELECT NULL FROM dict_en_zhtw2 r WHERE r.dict_word = l.dict_word)
Terry
16 Dec 15 at 15:55
My database is MariaDB 10.1.8, table format is innoDB.
Terry
16 Dec 15 at 15:59
@Terry: why do you join with dict_en_zhtw inside the IN in your query 3?
Quassnoi
22 Dec 15 at 12:03
Thank you for the article, it solved an empasse I was having.
Actually I got helped by doing the exact contrary of your conclusion:
“That’s why the best way to search for missing values in MySQL is using a LEFT JOIN / IS NULL or NOT IN rather than NOT EXISTS.”
I don’t agree it’s the best way, it’s the more response time efficient and it is the best only if that perfomance is an issue.
For example you didn’t measure the memory aspect of the query.
But most important to me, clarity of code was an issue because I had a stack of joins and negated conditions.
A 30% performance hit in a query executed once a day is equal to zero; in exchange I fixed a bug and left a cleaner, much more semantic code.
Developers are much slower than cpus and much more expensive.
Vittorio
1 Mar 17 at 22:33
There have been a number of improvements in the Optimizer since 2009.
To the Author (Quassnoi): Please rerun your tests, and specify which version they are being run against. Note also that MySQL and MariaDB may yield different results.
To the Reader: Test the variants yourself, do not blindly trust the conclusions in this blog.
Rick James
25 Oct 18 at 20:57
There is another MySQL join type called cross join. Can we used cross join in the scenario you have mentioned above? I think it should work to display the data. But I am new to MySQL so I might be mistaking as well.
Oliver Russell
10 Dec 18 at 16:15
@Oliver: no we can’t. CROSS JOIN yields the cartesian product of the tables or data sources which we can then filter on (effectively making it an INNER JOIN). There is no way to filter the product on “something which does not exist in another table” short of using the actual EXISTS or NOT IN clause, but we don’t need the CROSS JOIN at all then.
Quassnoi
10 Dec 18 at 22:01
Thank you dear.
In my case, “NOT IN”-query was the correct solution only. “NOT EXISTS” and “LEFT JOIN / IS NULL” queries were generally unsuccessful.
In my case:
t_left have more 2 000 000 rows.
t_right have more 8 000 000 rows.
There are 607568 rows in t_left with values not present in t_right.
field t_left.value varchar(40) NOT NULL
field t_left.value varchar(20) NOT NULL
I want select COUNT of rows in with values not present in t_right.
SELECT COUNT(*)
FROM t_left l
WHERE value NOT IN
(
SELECT value
FROM t_right
)
bunak
25 Jul 19 at 19:41
In the LEFT JOIN example above, don’t we need to use SELECT DISTINCT? If there is a match in the table we are checking, we should get exactly 1 row in the left semijoin and/or antijoin.
Ryan Rosario
16 Oct 22 at 02:47
I’m not really following what “exactly 1 row in the left semijoin” means. All three queries return one record from t_left for which there are no corresponding records in t_right. If there are duplicates in t_left, they will be returned by all three queries. If you don’t want the duplicates, use the DISTINCT, but you have to do this for all three methods, not just for LEFT JOIN.
Quassnoi
18 Oct 22 at 17:55